-
Notifications
You must be signed in to change notification settings - Fork 241
Maintenance mode stacking support #3044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Maintenance mode stacking support #3044
Conversation
|
Hi @junkaixue , @GrantPSpencer , @zpinto , @xyuanlu |
9cb7f35 to
b6a081b
Compare
| return false; | ||
| } | ||
|
|
||
| return signal.hasMaintenanceReasons(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be dangerous. What if the new version read an old ZNode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code.
|
Highly recommend the new contributor starting with stablizing the tests instead of touch the very core part. It is very very dangerous. There was one line log change can blast the entire server before. If you still believe your change is solid, we can help review. At the same time, please never lower the bar. |
Hey @junkaixue, Yes I believe the current version of the change is solid and is in good shape. I have tested the change thoroughly with multiple test cases. |
proud-parselmouth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this effort, this is one complicated logic. I think you have covered all the scenarios, keep up the good work.
Can you PTAL at my review comments, seems like we can refactor this code more.
| * @param reason | ||
| * @param customFields user-specified KV mappings to be stored in the ZNode | ||
| */ | ||
| void automationEnableMaintenanceMode(String clusterName, boolean enabled, String reason, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already 3 methods which similar name, enableMM, autoEnableMM, manuallyEnableMM and now automationEnableMaintenanceMode.
I have multiple queries here
- Is there a reason to not why we are not overloading the new method with the existing name
autoEnableMM - IMO, there should only be one method
enableMMwith different triggering entities. Should we create an issue in apache helix as todo for this?
| // The cluster is in maintenance mode if the maintenance signal ZNode exists | ||
| // This includes cases where old clients have wiped listField data but simpleFields remain | ||
| // cluster should remain in maintenance mode as long as ZNode exists | ||
| return signal.hasMaintenanceReasons() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the check for the empty string, as this may break backward compatibility?
| logger.info("Entity {} doesn't have a maintenance reason entry, exit request ignored", triggeringEntity); | ||
| } | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else shouldn't be needed, do an early check after if(!enabled) and exit from the method if maintenance signal is null
| * @return true if a reason was removed, false otherwise | ||
| */ | ||
| public boolean removeMaintenanceReason(TriggeringEntity triggeringEntity) { | ||
| LOG.info("Removing maintenance reason for entity: {}", triggeringEntity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method needs to be refactored.
- Get maintenance
reasons = getMaintenanceReasons() - Get filtered reasons
filteredReasons = filterReasons(reasons, null, triggeringEntity). Write a method that would take includeEntities list and excludeEntitiesList - Return
falseearly, if!filteredReasons.size().equals(reasons.size()) - In the list fields
reasonswe are always adding the reasons at the end, hence the above arrayfilteredReasonsshould aways be sorted - Always Set/Reset the simpleFields if filteredReasons.size() != 0
- Return true.
| // The triggering entity is our unique key - Overwrite any existing entry with this entity | ||
| String triggerEntityStr = triggeringEntity.name(); | ||
|
|
||
| List<Map<String, String>> reasons = getMaintenanceReasons(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you filterOut the reasons which for the given triggeringEntity and then always add the reason as a new reason at the end of the reasons list
| * @param triggeringEntity The entity to check | ||
| * @return true if there is a maintenance reason from this entity | ||
| */ | ||
| public boolean hasMaintenanceReason(TriggeringEntity triggeringEntity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use filterReasons instead of this method, the caller can add a check on size, or this method can add a check on size.
Again I don't see an explicit need of this method
| * @param triggeringEntity The entity to get reason details for | ||
| * @return Map containing reason details, or null if not found | ||
| */ | ||
| public Map<String, String> getMaintenanceReasonDetails(TriggeringEntity triggeringEntity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems similar to filterReasons, why is it public?
| * | ||
| * @return The count of active maintenance reasons | ||
| */ | ||
| public int getMaintenanceReasonsCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this.
| * @param triggeringEntity The entity to get reason for | ||
| * @return The reason string, or null if not found | ||
| */ | ||
| public String getMaintenanceReason(TriggeringEntity triggeringEntity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this called?
can we instead do at the caller
reasons = filterReasons(getMainteanancerReasons(), List.of(triggeringEntity), null)
String mr = reasons.size() != 0 ? reasons.get(0).getOrDefault(REASON, null) : null
|
|
||
| // Only reconcile USER data from legacy clients | ||
| // CONTROLLER and AUTOMATION should not have legacy data loss scenarios | ||
| if (simpleReason != null && !simpleReason.isEmpty() && simpleEntity == TriggeringEntity.USER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be more readable, if we return early like this
reasons = getMaintaneanceReasons()
if (simpleReason == null || simpleReason.isEmpty() || filterReasons(reasons, TriggeringEntity.USER).size() > 0){
return
}
... rest of the logic.
4e9cf85 to
4f829df
Compare
Issues
(#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
This PR closes #3041 , adds support for Maintenance mode Stacking.
Description
(Write a concise description including what, why, how)
The current implementation of maintenance mode for clusters supports only a single reason at a time, tracked using the simpleFields.REASON key. This restricts the functionality to a single actor and reason, which limits flexibility and coordination.
This proposal introduces a new design that allows multiple actors to independently place a cluster into maintenance mode for different reasons. We will extend the maintenance mode design to support multiple actors, each capable of independently adding or removing their own maintenance reason. The cluster will remain in maintenance mode as long as at least one active reason exists. Each reason will be associated with metadata such as the actor, reason, and timestamp. For backwards compatibility, the existing simpleFields.REASON will be retained and updated to reflect the most recent active reason. If a reason is removed, it will be replaced with the next most recent one. While legacy clients that remove the entire znode cannot be completely prevented, we will handle such cases gracefully and recommend migrating to an updated API that enables proper multi-actor maintenance handling.
Tests
testAutomationMaintenanceMode, testRemoveMaintenanceReasonNoDuplicates, testLegacyClientCompatibility, testMaintenanceHistoryAfterOperationFlag, testMultiActorMaintenanceModeExitSequence, testMultiActorMaintenanceModeReconciliation, testMultiActorMaintenanceModeOldClientExit, testMultiActorMaintenanceModeOldClientOverride, testMultiActorMaintenanceModeInvalidExit
(List the names of added unit/integration tests)
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)